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Abstract. In the past decade computational biology has grown from 
a cottage industry with a handful of researchers to an attractive in- 
terdisciplinary field, catching the attention and imagination of many 
quantitatively-minded scientists. Of interest to us is the key role played 
by the EM algorithm during this transformation. We survey the use of 
the EM algorithm in a few important computational biology problems 
surrounding the "central dogma" of molecular biology: from DNA to 
RNA and then to proteins. Topics of this article include sequence motif 
discovery, protein sequence alignment, population genetics, evolution- 
ary models and mRNA expression microarray data analysis. 

Key words and phrases: EM algorithm, computational biology, liter- 
ature review. 
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1. INTRODUCTION 

1.1 Computational Biology 

Started by a few quantitatively minded biologists 
and biologically minded mathematicians in the 1970s, 
computational biology has been transformed in the 
past decades to an attractive interdisciplinary field 
drawing in many scientists. The use of formal statis- 
tical modeling and computational tools, the expecta- 
tion-maximization (EM) algorithm, in particular, 
contributed significantly to this dramatic transition 
in solving several key computational biology prob- 
lems. Our goal here is to review some of the histor- 
ical developments with technical details, illustrat- 
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ing how biology, traditionally regarded as an empir- 
ical science, has come to embrace rigorous statistical 
modeling and mathematical reasoning. 

Before getting into details of various applications 
of the EM algorithm in computational biology, we 
first explain some basic concepts of molecular biol- 
ogy. Three kinds of chain biopolymers are the cen- 
tral molecular building blocks of life: DNA, RNA 
and proteins. The DNA molecule is a double-stranded 
long sequence composed of four types of nucleotides 
(A, C, G and T). It has the famous double-helix 
structure, and stores the hereditary information. RNA 
molecules are very similar to DNAs, composed also 
of four nucleotides (A, C, G and U). Proteins are 
chains of 20 different basic units, called amino acids. 

The genome of an organism generally refers to the 
collection of all its DNA molecules, called the chro- 
mosomes. Each chromosome contains both the pro- 
tein (or RNA) coding regions, called genes, and non- 
coding regions. The percentage of the coding regions 
varies a lot among genomes of different species. For 
example, the coding regions of the genome of baker's 
yeast are more than 50%, whereas those of the hu- 
man genome are less than 3%. 

RNAs are classified into many types, and the three 
most basic types are as follows: messenger RNA 
(mRNA), transfer RNA (tRNA) and ribosomal RNA 
(rRNA). An mRNA can be viewed as an intermedi- 
ate copy of its corresponding gene and is used as a 



1 



2 



X. FAN, Y. YUAN AND J. S. LIU 



template for constructing the target protein. tRNA 
is needed to recruit various amino acids and trans- 
port them to the template mRNA. mRNA, tRNA 
and amino acids work together with the construc- 
tion machineries called ribosomes to make the final 
product, protein. One of the main components of 
ribosomes is the third kind of RNA, rRNA. 

Proteins carry out almost all essential functions in 
a cell, such as catalysation, signal transduction, gene 
regulation, molecular modification, etc. These capa- 
bilities of the protein molecules are dependent of 
their 3-dimensional shapes, which, to a large extent, 
are uniquely determined by their one-dimensional 
sequence compositions. In order to make a protein, 
the corresponding gene has to be transcribed into 
mRNA, and then the mRNA is translated into the 
protein. The "central dogma" refers to the concerted 
effort of transcription and translation of the cell. 
The expression level of a gene refers to the amount 
of its mRNA in the cell. 

Differences between two living organisms are most- 
ly due to the differences in their genomes. Within 
a multicellular organism, however, different cells may 
differ greatly in both physiology and function even 
though they all carry identical genomic information. 
These differences are the result of differential gene 
expression. Since the mid-1990s, scientists have de- 
veloped microarray techniques that can monitor si- 
multaneously the expression levels of all the genes in 
a cell, making it possible to construct the molecular 
"signature" of different cell types. These techniques 
can be used to study how a cell responds to differ- 
ent interventions, and to decipher gene regulatory 
networks. A more detailed introduction of the ba- 
sic biology for statisticians is given by Ji and Wong 
(2006). 

With the help of the recent biotechnology revolu- 
tion, biologists have generated an enormous amount 
of molecular data, such as billions of base pairs of 
DNA sequence data in the GenBank, protein struc- 
ture data in PDB, gene expression data, biological 
pathway data, biopolymer interaction data, etc. The 
explosive growth of various system-level molecular 
data calls for sophisticated statistical models for in- 
formation integration and for efficient computational 
algorithms. Meanwhile, statisticians have acquired 
a diverse array of tools for developing such models 
and algorithms, such as the EM algorithm (Demp- 
ster, Laird and Rubin (1977)), data augmentation 
(Tanner and Wong (1987)), Gibbs sampling (Ge- 
man and Geman (1984)), the Metropolis-Hastings 



algorithm (Metropolis and Ulam (1949); Metropolis 
et al. (1953); Hastings (1970)), etc. 

1.2 The Expectation-Maximization Algorithm 

The expectation-maximization (EM) algorithm 
(Dempster, Laird and Rubin, 1977) is an iterative 
method for finding the mode of a marginal likeli- 
hood function (e.g., the MLE when there is missing 
data) or a marginal distribution (e.g., the maximum 
a posteriori estimator). Let Y denote the observed 
data, the parameters of interest, and T the nui- 
sance parameters or missing data. The goal is to 
maximize the function 

p(Y\@) = J p(Y,T\@)dT, 

which cannot be solved analytically. A basic assump- 
tion underlying the effectiveness of the EM algo- 
rithm is that the complete-data likelihood or the 
posterior distribution, p(Y,T\@), is easy to deal 
with. Starting with a crude parameter estimate ©(°) , 
the algorithm iterates between the following Expec- 
tation (E-step) and Maximization (M-step) steps 
until convergence: 

• E-step: Compute the Q-function: 

Q(©|0(*)) = £; r|0(4)iY [iogp(Y,r|©)]. 

• M-step: Finding the maximand: 

0( m > = argmaxQ(0|0W). 

Unlike the Newton-Rap hson and scoring algorithms, 
the EM algorithm does not require computing the 
second derivative or the Hessian matrix. The EM 
algorithm also has the nice properties of monotone 
nondecreasing in the marginal likelihood and sta- 
ble convergence to a local mode (or a saddle point) 
under weak conditions. More importantly, the EM 
algorithm is constructed based on the missing data 
formulation and often conveys useful statistical in- 
sights regarding the underlying statistical model. A 
major drawback of the EM algorithm is that its con- 
vergence rate is only linear, proportional to the frac- 
tion of "missing information" about (Dempster, 
Laird and Rubin (1977)). In cases with a large pro- 
portion of missing information, the convergence rate 
of the EM algorithm can be very slow. To monitor 
the convergence rate and the local mode problem, a 
basic strategy is to start the EM algorithm with mul- 
tiple initial values. More sophisticated methods are 
available for specific problems, such as the "backup- 
buffering" strategy in Qin, Niu and Liu (2002). 
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1.3 Uses of the EM Algorithm in Biology 

The idea of iterating between filling in the miss- 
ing data and estimating unknown parameters is so 
intuitive that some special forms of the EM algo- 
rithm appeared in the literature long before Demp- 
ster, Laird and Rubin (1977) defined it. The earliest 
example on record is by McKendrick (1926), who in- 
vented a special EM algorithm for fitting a Poisson 
model to a cholera infection data set. Other early 
forms of the EM algorithm appeared in numerous 
genetics studies involving allele frequency estima- 
tion, segregation analysis and pedigree data anal- 
ysis (Ceppellini, Siniscalco and Smith, 1955; Smith, 
1957; Ott, 1979). A precursor to the broad recog- 
nition of the EM algorithm by the computational 
biology community is Churchill (1989), who applied 
the EM algorithm to fit a hidden Markov model 
(HMM) for partitioning genomic sequences into re- 
gions with homogenous base compositions. Lawrence 
and Reilly (1990) first introduced the EM algorithm 
for biological sequence motif discovery. Haussler et al. 

(1993) and Krogh et al. (1994) formulated an inno- 
vative HMM and used the EM algorithm for pro- 
tein sequence alignment. Krogh, Mian and Haussler 

(1994) extended these algorithms to predict genes 
in E. coli DNA data. During the past two decades, 
probabilistic modeling and the EM algorithm have 
become a more and more common practice in com- 
putational biology, ranging from multiple sequence 
alignment for a single protein family (Do et al., 2005) 
to genome- wide predictions of protein-protein inter- 
actions (Deng et al., 2002), and to single-nucleotide 
polymorphism (SNP) haplotype estimation (Kang 
et al. (2004)). 

As noted in Meng and Pedlow (1992) and Meng 
(1997), there are too many EM-related papers to 
track. This is true even within the field of computa- 
tional biology. In this paper we only examine a few 
key topics in computational biology and use typical 
examples to show how the EM algorithm has paved 
the road for these studies. The connection between 
the EM algorithm and statistical modeling of com- 
plex systems is essential in computational biology. 
It is our hope that this brief survey will stimulate 
further EM applications and provide insight for the 
development of new algorithms. 

Discrete sequence data and continuous expression 
data are two of the most common data types in com- 
putational biology. We discuss sequence data analy- 
sis in Sections 2-5, and gene expression data analysis 



in Section 6. A main objective of computational bi- 
ology research surrounding the "central dogma" is 
to study how the gene sequences affect the gene ex- 
pression. In Section 2 we attempt to find conserved 
patterns in functionally related gene sequences as 
an effort to explain the relationship of their gene 
expression. In Section 3 we give an EM algorithm 
for multiple sequence alignment, where the goal is to 
establish "relatedness" of different sequences. Based 
on the alignment of evolutionary related DNA se- 
quences, another EM algorithm for detecting po- 
tentially expression-related regions is introduced in 
Section 4. An alternative way to deduce the relation- 
ship between gene sequence and gene expression is to 
check the effect of sequence variation within the pop- 
ulation of a species. In Section 5 we provide an EM 
algorithm to deal with this type of small sequence 
variation. In Section 6 we review the clustering anal- 
ysis of microarray gene-expression data, which is 
important for connecting the phenotype variation 
among individuals with the expression level varia- 
tion. Finally, in Section 7 we discuss trends in com- 
putational biology research. 

2. SEQUENCE MOTIF DISCOVERY AND 
GENE REGULATION 

In order for a gene to be transcribed, special pro- 
teins called transcription factors (TFs) are often re- 
quired to bind to certain sequences, called transcrip- 
tion factor binding sites (TFBSs). These sites are 
usually 6-20 bp long and are mostly located up- 
stream of the gene. One TF is usually involved in the 
regulation of many genes, and the TFBSs that the 
TF recognizes often exhibit strong sequence speci- 
ficity and conservation (e.g., the first position of 
the TFBSs is likely T, etc.). This specific pattern 
is called a TF binding motif (TFBM). For example, 
Figure 1 shows a motif of length 6. The motif is rep- 
resented by the position-specific frequency matrix 
(#1, . . . , 6q), which is derived from the alignment of 
5 motif sites by calculating position-dependent fre- 
quencies of the four nucleotides. 

In order to understand how genes' mRNA expres- 
sion levels are regulated in the cell, it is crucial 
to identify TFBSs and to characterize TFBMs. Al- 
though much progress has been made in developing 
experimental techniques for identifying these TF- 
BSs, these techniques are typically expensive and 
time-consuming. They are also limited by experi- 
mental conditions, and cannot pinpoint the bind- 
ing sites exactly. In the past twenty years, compu- 
tational biologists and statisticians have developed 
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many successful in silico methods to aid biologists 
in finding TFBSs, and these efforts have contributed 
significantly to our understanding of transcription 
regulation. 

Likewise, motif discovery for protein sequences is 
important for identifying structurally or function- 
ally important regions (domains) and understand- 
ing proteins' functional components, or active sites. 
For example, using a Gibbs sampling-based motif 
finding algorithm, Lawrence et al. (1993) was able 
to predict the key helix-turn-helix motif among a 
family of transcription activators. Experimental ap- 
proaches for determining protein motifs are even 
more expensive and slower than those for DNAs, 
whereas computational approaches are more effec- 
tive than those for TFBSs predictions. 

The underlying logic of computational motif disco- 
very is to find patterns that are "enriched" in a given 
set of sequence data. Common methods include word 
enumeration (Sinha and Tompa (2002); Hampson, 
Kibler and Baldi (2002); Pavesi et al. (2004)), po- 
sition-specific frequency matrix updating (Stormo 
and Hartzell (1989); Lawrence and Reilly (1990); 
Lawrence et al. (1993)) or a combination of the two 
(Liu, Brutlag and Liu, 2002). The word enumera- 
tion approach uses a specific consensus word to rep- 
resent a motif. In contrast, the position-specific fre- 
quency matrix approach formulates a motif as a 
weight matrix. Jensen et al. (2004) provide a re- 
view of these motif discovery methods. Tompa et al. 
(2005) compared the performance of various motif 
discovery tools. Traditionally, researchers have em- 
ployed various heuristics, such as evaluating exces- 
siveness of word counts or maximizing certain in- 
formation criteria to guide motif finding. The EM 
algorithm was introduced by Lawrence and Reilly 
(1990) to deal with the motif finding problem. 

As shown in Figure 1, suppose we are given a 
set of K sequences Y = (Yi, . . . , Y^-), where Y^ = 



Y 4 : 
Y5: 



CCTTATAACCTAGC 
ATCATGGTATCACTCGAGC 
CTGATCTGTAATAGCTT 
GCTCTATATAACGCTAGCCA 
TCTAGCATATAACCGTATCT 



w = 6 
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Fig. 1. Transcription factor binding sites and motifs. (A) 
Each of the five sequences contains a TFBS of length 6. The 
local alignment of these sites is shown in the gray box. (B) The 
frequency of the nucleotides outside of the gray box is shown 
as 60 ■ The frequency of the nucleotides in the ith column of 
the gray box is shown as 6i . 



(Xk,\i ■•■■> Yk,L k ) and Y^i takes values in an alphabet 
of d residues (d = 4 for DNA/RNA and 20 for pro- 
tein). The alphabet is denoted by R = (rx, . . . , r<f). 
Motif sites in this paper refer to a set of contiguous 
segments of the same length w (e.g., the marked 6- 
mers in Figure 1). This concept can be further gener- 
alized via a hidden Markov model to allow gaps and 
position deletions (see Section 3 for HMM discus- 
sions). The weight matrix, or Product- Multinomial 
motif model, was first introduced by Stormo and 
Hartzell (1989) and later formulated rigorously in 
Liu, Neuwald and Lawrence (1995). It assumes that, 
if Yj^i is the ith position of a motif site, it follows 
the multinomial distribution with the probability 
vector 6i = (9n, . . . , 9 id); we denote this model as 
PM(6i, . . . , 6 W ). If Y^j does not belong to any motif 
site, it is generated independently from the multino- 
mial distribution with parameter 6q = (#01 > • • • > God)- 

Let = (6q , 61, . . . ,6 W ). For sequence Y&, there 
are L' k = — w + 1 possible positions a motif site of 
length w may start. To represent the motif locations, 
we introduce the unobserved indicators Y = {Tf. 1 \ 
1 <k < K, 1 <l < L' k }, where T^i = 1 if a motif site 
starts at position I in sequence Y&, and = oth- 
erwise. As shown in Figure 1, it is straightforward to 
estimate if we know where the motif sites are. The 
motif location indicators V are the missing data that 
makes the EM framework a natural choice for this 
problem. For illustration, we further assume that 
there is exactly one motif site within each sequence 
and that its location in the sequence is uniformly 
distributed. This means that Y2i ^k,l = 1 f° r & H k 
and P(r k j = 1) = zr- 

Given Tj-j = 1, the probability of each observed 
sequence Y& is 



(1) p(Y fe |r M = i,0) = ^ (Bfc - l) 



h(Y fc ,, 



In this expression, = {Y/, ~ :j<lorj>l + w} 
is the set of letters of nonsite positions of Y^ . The 
counting function h(-) takes a set of letter symbols 
as input and outputs the column vector (m, . . . , n^) T , 
where rjj is the number of base type ri in the input 
set. We define the vector power function as ^ = 

jd a". 



Ilj=i @ij f° r i = 0, ■ ■ ■ ,w. Thus, the complete-data 



3 for i 

likelihood function is the product of equation (1) 
for k from 1 to K, that is, 

K L' k 

p(Y,r|0) ex 11 JJp(Y fc |r fcjl = i,0) rfc .' 
fe=i/=i 
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h(M«) 



i=l 



(0 

where Bp is the set of all nonsite bases, and Mp 
is the set of nucleotide bases at position i of the 
TFBSs given the indicators T. 

The MLE of © from the complete-data likelihood 
can be determined by simple counting, that is, 



h(Mg } ; 



and 



o 



h(B r ) 



The EM algorithm for this problem is quite intu- 
itive. In the E-step, one uses the current parame- 
ter values to compute the expected values of 
ft(Mp ) and /i(Br). More precisely, for sequence 
Yfc, we compute its likelihood of being generated 
from ©(*) conditional on each possible motif loca- 
tion T k , = 1, 



w Kl = P(Y k \T k>l = l,®V) 

0\ \^ Yk ^ ( \ h ft,i+»-i) 

To) '"{To 



h(Y fe ) 



Letting W k = Yli=i w k,u we then compute the ex- 
pected count vectors as 



E r|0(i)iY [h(M»)] = ^^^h(n )W _ 1 ), 
k=i i=i k 

S r|e(t ) Y [h(Br)] = h({F M : 1 < k < K, 1 < I < L k }) 

w 

-E^rieW.vIMM?)]. 

-i=i 

In the M-step, one simply computes 
^rieWvIMM?)] 



K L' k 
\- \- W k j 



(H-l) 



A' 



and 



6> 



(i+i) _ fi r|8"),Yl h ( B r)] 
Hk=i(Lk-w) 



It is necessary to start with a nonzero initial weight 
matrix 0^°^ so as to guarantee that P(Y k \T k j = 
1,0 W ) > for all I. At convergence the algorithm 
yields both the MLE and predictive probabili- 
ties for candidate TFBS locations, that is, P(T k j = 
1|0,Y). 

Cardon and Stormo (1992) generalized the above 
simple model to accommodate insertions of variable 



lengths in the middle of a binding site. To over- 
come the restriction that each sequence contains ex- 
actly one motif site, Bailey and Elkan (1994, 1995a, 
1995b) introduced a parameter po describing the 
prior probability for each sequence position to be the 
start of a motif site, and designed a modified EM al- 
gorithm called the Multiple EM for Motif Elicitation 
(MEME). Independently, Liu, Neuwald and Lawrence 
(1995) presented a full Bayesian framework and Gibbs 
sampling algorithm for this problem. Compared with 
the EM approach, the Markov chain Monte Carlo 
(MCMC)-based approach has the advantages of mak- 
ing more flexible moves during the iteration and in- 
corporating additional information such as motif lo- 
cation and orientation preference in the model. 

The generalizations in Bailey and Elkan (1994) 
and Liu, Neuwald and Lawrence (1995) assume that 
all overlapping subsequences of length w in the se- 
quence data set are from a finite mixture model. 
More precisely, each subsequence of length w is trea- 
ted as an independent sample from a mixture of 
PM(0 1 ,...,0 w ) and PM(0 O , . . . ,6 ) [independent 
Multinomial^) in all w positions]. The EM solu- 
tion of this mixture model formulation then leads 
to the MEME algorithm of Bailey and Elkan (1994). 
To deal with the situation that w may not be known 
precisely, MEME searches motifs of a range of dif- 
ferent widths separately, and then performs model 
selection by optimizing a heuristic function based 
on the maximum likelihood ratio test. Since its re- 
lease, MEME has been one of the most popular mo- 
tif discovery tools cited in the literature. The Google 
scholar search gives a count of 1397 citations as of 
August 30th, 2009. Although it is 15 years old, its 
performance is still comparable to many new algo- 
rithms (Tompa et al., 2005). 

3. MULTIPLE SEQUENCE ALIGNMENT 

Multiple sequence alignment (MSA) is an impor- 
tant tool for studying structures, functions and the 
evolution of proteins. Because different parts of a 
protein may have different functions, they are sub- 
ject to different selection pressures during evolution. 
Regions of greater functional or structural impor- 
tance are generally more conserved than other re- 
gions. Thus, a good alignment of protein sequences 
can yield important evidence about their functional 
and structural properties. 

Many heuristic methods have been proposed to sol- 
ve the MSA problem. A popular approach is the pro- 
gressive alignment method (Feng and Doolittle, 1987), 
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in which the MSA is built up by aligning the most 
closely related sequences first and then adding more 
distant sequences successively. Many alignment pro- 
grams are based on this strategy, such as MUL- 
TALIGN (Barton and Sternberg, 1987), MULTAL 
(Taylor, 1988) and, the most influential one, ClustalW 
(Thompson, Higgins and Gibson, 1994). Usually, a 
guide tree based on pairwise similarities between the 
protein sequences is constructed prior to the multi- 
ple alignment to determine the order for sequences 
to enter the alignment. Recently, a few new pro- 
gressive alignment algorithms with significantly im- 
proved alignment accuracies and speed have been 
proposed, including T-Coffee (Notredame, Higgins 
and Heringa (2000)), MAFFT (Katoh et al., 2005), 
PROBCONS (Do et al, 2005) and MUSCLE (Edgar, 
2004a, 2004b). They differ from previous approaches 
and each other mainly in the construction of the 
guide tree and in the objective function for judging 
the goodness of the alignment. Batzoglou (2005) and 
Wallace, Blackshields and Higgins (2005) reviewed 
these algorithms. 

An important breakthrough in solving the MSA 
problem is the introduction of a probabilistic gen- 
erative model, the profile hidden Markov model by 
Krogh et al. (1994). The profile HMM postulates 
that the N observed sequences are generated as in- 
dependent but indirect observations (emissions) from 
a Markov chain model illustrated in Figure 2. The 
underlying unobserved Markov chain consists of three 
types of states: match, insertion and deletion. Each 
match or insertion state emits a letter chosen from 
the alphabet R (size d = 20 for proteins) accord- 
ing to a multinomial distribution. The deletion state 
does not emit any letter, but makes the sequence 
generating process skip one or more match states. A 
multiple alignment of the N sequences is produced 
by aligning the letters that are emitted from the 
same match state. 

Let Ti denote the unobserved state path through 
which the ith sequence is generated from the profile 
HMM, and S the set of all states. Let denote the 
set of all global parameters of this model, including 
emission probabilities in match and insertion states 
ei r (I £ S, r 6 R), and transition probabilities among 
all hidden states t ab (a,b 6 S). The complete-data 
log-likelihood function can be written as 

iogP(Y,r|©) 

N 

= ^[iogP(Y 4 |r i ,0) + iogP(r i |©)] 

i=l 




Fig. 2. Profile hidden Markov model. A modified toy exam- 
ple is adopted from Eddy (1998). It shows the alignment of 
five sequences, each containing only three to five letters. The 
first position is enriched with Cysteine (C), the fourth posi- 
tion is enriched with Histidine (H), and the fifth position is 
enriched with Phenylalanine (F) and Tyrosine (Y). The third 
sequence has a deletion at the fourth position, and the fourth 
sequence has an insertion at the third position. This simplified 
model does not allow insertion and deletion states to follow 
each other. 



N 

£ 

i=i 



V M lr (Ti) log e lr 



lessen 



N ab (Ti) log t ab 

a,beS 



where M[ r (Ti) is the count of letter r in sequence 
Yj that is generated from state I according to Ti, 
and N ab (Ti) is the count of state transitions from a 
to b in the path Ti for sequence Yj. 

The E-step involves calculating the expected counts 
of emissions and transitions, that is, E[Mi r (Ti)\&^] 
and E[N ao (Ti)\®^], averaging over all possible gen- 
erating paths Ti. The Q-function is 



N 



Q(e|eW) = ££ 



=i r, 



p(r t ,Y,|0W; 

P(Y|©W) 



log(e lr )M lr (Ti 

26S,reR 

+ log(t ab )N ab (Ti] 

a,beS 



A brute-force enumeration of all paths is prohibitively 
expensive in computation. Fortunately, one can ap- 
ply a forward-backward dynamic programming tech- 
nique to compute the expectations for each sequence 
and then sum them all up. 

In the M-step, the emission and transition prob- 
abilities are updated as the ratio of the expected 
event occurrences (sufficient statistics) divided by 
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the total expected emission or transition events: 

e(m) = EiWYQ/PtYiieW)} 

lr " E i W/(Y i )/ J P(Y i |©W)} ' 

t (t+i) = Ea^(Y,)/P(Y,|0W)} 
ab ' r, i {na(Y i )/P(Y i \&V)} , 

where 

m lr (Y i )=^M lr (T i )P(T i ,Y i \®®), 

n a6 (Yi) = ^N^T^PiXi, Yi|e(*)), 

m;(Yj) = ^m ir (Yj), n a (Yj) = ^n a6 (Yj). 
r-eR 6gs 

This method is called the Baum-Welch algorithm 
(Baum et al., 1970), and is mathematically equiva- 
lent to the EM algorithm. Conditional on the MLE 
0, the best alignment path for each sequence can 
be found efficiently by the Viterbi algorithm (see 
Durbin et al., 1998, Chapter 5, for details). 

The profile HMM provides a rigorous statistical 
modeling and inference framework for the MSA prob- 
lem. It has also played a central role in advancing 
the understanding of protein families and domains. 
A protein family database, Pfam (Finn et al., 2006), 
has been built using profile HMM and has served 
as an essential source of data in the field of pro- 
tein structure and function research. Currently there 
are two popular software packages that use profile 
HMMs to detect remote protein homologies: HM- 
MER (Eddy, 1998) and SAM (Hughey and Krogh, 
1996; Karplus, Barrett and Hughey, 1999). Madera 
and Gough (2002) gave a comparison of these two 
packages. 

There are several challenges in fitting the profile 
HMM. First, the size of the model (the number of 
match, insertion and deletion states) needs to be 
determined before model fitting. It is common to 
begin fitting a profile HMM by setting the num- 
ber of match states equal to the average sequence 
length. Afterward, a strategy called "model surgery" 
(Krogh et al., 1994) can be applied to adjust the 
model size (by adding or removing a match state 
depending on whether an insertion or a deletion is 
used too often). Eddy (1998) used a maximum a pos- 
teriori (MAP) strategy to determine the model size 
in HMMER. In this method the number of match 
states is given a prior distribution, which is equiva- 
lent to adding a penalty term in the log-likelihood 
function. 



Second, the number of sequences is sometimes too 
small for parameter estimation. When calculating 
the conditional expectation of the sufficient statis- 
tics, which are counts of residues at each state and 
state transitions, there may not be enough data, re- 
sulting in zero counts which could make the esti- 
mation unstable. To avoid the occurrence of zero 
counts, pseudo-counts can be added. This is equiv- 
alent to using a Dirichlet prior for the multinomial 
parameters in a Bayesian formulation. 

Third, the assumption of sequence independence 
is often violated. Due to the underlying evolution- 
ary relationship (unknown), some of the sequences 
may share much higher mutual similarities than oth- 
ers. Therefore, treating all sequences as i.i.d. sam- 
ples may cause serious biases in parameter estima- 
tion. One possible solution is to give each sequence 
a weight according to its importance. For example, 
if two sequences are identical, it is reasonable to give 
each of them half the weight of other sequences. The 
weights can be easily integrated into the M-step of 
the EM algorithm to update the model parameters. 
For example, when a sequence has a weight of 0.5, 
all the emission and transition events contributed by 
this sequence will be counted by half. Many meth- 
ods have been proposed to assign weights to the se- 
quences (Durbin et al., 1998), but it is not clear how 
to set the weights in a principled way to best account 
for sequence dependency. 

Last, since the EM algorithm can only find lo- 
cal modes of the likelihood function, some stochas- 
tic perturbation can be introduced to help find bet- 
ter modes and improve the alignment. Starting from 
multiple random initial parameters is strongly rec- 
ommended. Krogh et al. (1994) combined simulated 
annealing into Baum-Welch and showed some im- 
provement. Baldi and Chauvin (1994) developed a 
generalized EM (GEM) algorithm using a gradient 
ascent calculation in an attempt to infer HMM pa- 
rameters in a smoother way. 

Despite many advantages of the profile HMM, it 
is no longer the mainstream MSA tool. A main rea- 
son is that the model has too many free parameters, 
which render the parameter estimation very unsta- 
ble when there are not enough sequences (fewer than 
50, say) in the alignment. In addition, the vanilla 
EM algorithm and its variations developed by early 
researchers for the MSA problem almost always con- 
verge to suboptimal alignments. Recently, Edlefsen 
(2009) have developed an ECM algorithm for MSA 
that appears to have much improved convergence 
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properties. It is also difficult for the profile HMM 
to incorporate other kinds of information, such as 
3D protein structure and guide tree. Some recent 
programs such as 3D-Coffee (O'Sullivan et al., 2004) 
and MAFFT are more flexible as they can incorpo- 
rate this information into the objective function and 
optimize it. We believe that the Monte Carlo-based 
Bayesian approaches, which can impose more model 
constraints (e.g., to capitalize on the "motif" con- 
cept) and make more flexible MCMC moves, might 
be a promising route to rescue profile HMM (see 
Liu, Neuwald and Lawrence, 1995; Neuwald and Liu, 
2004). 

4. COMPARATIVE GENOMICS 

A main goal of comparative genomics is to iden- 
tify and characterize functionally important regions 
in the genome of multiple species. An assumption 
underlying such studies is that, due to evolution- 
ary pressure, functional regions in the genome evolve 
much more slowly than most nonfunctional regions 
due to functional constraints (Wolfe, Sharp and Li, 
1989; Boffelli et al., 2003). Regions that evolve more 
slowly than the background are called evolutionarily 
conserved elements. 

Conservation analysis (comparing genomes of re- 
lated species) is a powerful tool for identifying func- 
tional elements such as protein/RNA coding regions 
and transcriptional regulatory elements. It begins 
with an alignment of multiple orthologous sequences 
(sequences evolved from the same common ances- 
tral sequence) and a conservation score for each col- 
umn of the alignment. The scores are calculated 
based on the likelihood that each column is located 
in a conserved element. The phylogenetic hidden 
Markov model (Phylo-HMM) was introduced to in- 
fer the conserved regions in the genome (Yang, 1995; 
Felsenstein and Churchill, 1996; Siepel et al., 2005). 
The statistical power of Phylo-HMM has been sys- 
tematically studied by Fan et al. (2007). Siepel et al. 
(2005) used the EM algorithm for estimating param- 
eters in Phylo-HMM. Their results, provided by the 
UCSC genome browser database (Karolchik et al., 
2003), are very influential in the computational bi- 
ology community. By August 2009, the paper of Sie- 
pel et al. (2005) had been cited 413 times according 
to the Web of Science database. 

As shown in Figure 3, the alignment modeled by 
Phylo-HMM can be seen as generated from two steps. 
First, a sequence of L sites is generated from a two- 
state HMM, with the hidden states being conserved 



or nonconserved sites. Second, a nucleotide is gener- 
ated for each site of the common ancestral sequence 
and evolved to the contemporary nucleotides along 
all branches of a phylogenetic tree independently ac- 
cording to the corresponding phylogenetic model. 

Let p, and v be the transition probabilities be- 
tween the two states, and let the phylogenetic mod- 
els for nonconserved and conserved states be ip n = 
(Q, it, r, (3) and ipc = (Qi 7r ,T, p(3) , respectively. Here 
7r is the emission probability vector of the four nu- 
cleotides (A, C, G and T) in the common ances- 
tral sequence xo; r is the tree topology of the cor- 
responding phylogeny; (3 is a vector of non-negative 
real numbers representing branch lengths of the tree, 
which are measured by the expected number of sub- 
stitutions per site. The difference between the two 
states is characterized by a scaling parameter p € 
[0,1) applied to the branch lengths of only the con- 
served state, which means fewer substitutions. The 
nucleotide substitution model considers a descen- 
dent nucleotide to have evolved from its ancestor by 
a continuous-time time-homogeneous Markov pro- 
cess with transition kernel Q, also called the sub- 
stitution rate matrix (Tavare, 1986). The transition 
kernels for all branches are assumed to be the same. 
Many parametric forms are available for the 4-by-4 
nucleotide substitution rate matrix Q, such as the 
Jukes-Cantor substitution matrix and the general 
time- reversible substitution matrix (Yang, 1997). The 
nucleotide transition probability matrix for a branch 
of length fa is . 

Siepel et al. (2005) assumed that the tree topol- 
ogy r and the emission probability vector ir are 
known. In this case, the observed alignment Y = 
(yi-)y2-)y3->y4-) is a matrix of nucleotides. The pa- 
rameter of interest is = (//, v, Q, p, (3). The missing 
information F = (z, X) includes the state sequence z 
and the ancestral DNA sequences X. The complete- 
data likelihood is written as 

P(Y,r|0) 

L 

= b Zl P{y.\,yi.i\^ Zl )W^a Zi _ lZi P(y.i,-K.i\^ Zi ). 

i=2 

Here y.j is the ith column of the alignment Y, Zi £ 
{c, n} is the hidden state of the ith column, (b c , b n ) = 
(y^j) ^ s ^ e hhtial state probability of the HMM 
if the chain is stationary, and a Zi _ lZi is the transition 
probability (as illustrated in Figure 3). 

The EM algorithm is applied to obtain the MLE 
of 0. In the E-step, we calculate the expectation 
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Fig. 3. Two-state Phylo-HMM. (A) Phylogenetic tree: The 
tree shows the evolutionary relationship of four contem- 
porary sequences (y\. , ya. , y3- , y4-). They are evolved from 
the common ancestral sequence xo-, with two additional in- 
ternal nodes (ancestors), xi. and X2. . The branch lengths 
(3 — {Pq, f}\, P2, fa, Pa, j3s) indicate the evolutionary distance 
between two nodes, which are measured by the expected num- 
ber of substitutions per site. (B) HMM state-transition dia- 
gram: The system consists of a state for conserved sites and a 
state for nonconserved sites (c and n, respectively). The two 
states are associated with different phylogenetic models (1(1 c 
and tp n ), which differ by a scaling parameter p. (C) An illus- 
trative alignment generated by this model: A state sequence 
(z) is generated according to p and v. For each site in the 
state sequence, a nucleotide is generated for the root node in 
the phylogenetic tree and then for subsequent child nodes ac- 
cording to the phylogenetic model (ip c or ip n ). The observed 
alignment Y = (yi., y2- , ys- , y* ) is composed of all nucleotides 
in the leaf nodes. The state sequence z and all ancestral se- 
quences X = (xo. , xi. , X2.) are unobserved. 

of the complete-data log-likelihood under the distri- 
bution P(z,X|0 (t) ,Y). The marginalization of X, 
conditional on z and other variables, can be accom- 
plished efficiently site-by-site using the peeling or 
pruning algorithm for the phylogenetic tree (Felsen- 
stein (1981)). The marginalization of z can be done 
efficiently by the forward-backward procedure for 
HMM (Baum et al., 1970; Rabiner, 1989). For the 
M-step, we can use the Broyden-Fletcher-Goldfarb- 
Shanno (BFGS) quasi-Newton algorithm. After we 
obtain the MLE of 0, a forward-backward dynamic 
programming method (Liu, 2001) can then be used 
to compute the posterior probability that a given 
hidden state is conserved, that is, P(zi = c|0,Y), 
which is the desired conservation score. 



As shown in the Phylo-HMM example, the phylo- 
genetic tree model is key to integrating multiple se- 
quences for evolutionary analysis. This model is also 
used for comparing protein or RNA sequences. Due 
to its intuitive and efficient handling of the miss- 
ing evolutionary history, the EM algorithm has al- 
ways been a main approach for estimating parame- 
ters of the tree. For example, Felsenstein (1981) used 
the EM algorithm to estimate the branch length (3, 
Bruno (1996) and Holmes and Rubin (2002) used 
the EM algorithm to estimate the residue usage ir 
and the substitution rate matrix Q, Friedman et al. 
(2002) used an extension of the EM algorithm to es- 
timate the phylogenetic tree topology r, and Holmes 
(2005) used the EM algorithm for estimating inser- 
tion and deletion rates. Yang (1997) implemented 
some of the above algorithms in the phylogenetic 
analysis software PAML. A limitation of the Phylo- 
HMM model is the assumption of a good multiple 
sequence alignment, which is often not available. 

5. SNP HAPLOTYPE INFERENCE 

A Single Nucleotide Polymorphism (SNP) is a DNA 
sequence variation in which a single base is altered 
that occurs in at least 1% of the population. For 
example, the DNA fragments CCTGAGGAG and 
CCTGTGGAG from two homologous chromosomes 
(the paired chromosomes of the same individual, one 
from each parent) differ at a single locus. This ex- 
ample is actually a real SNP in the human /3-globin 
gene, and it is associated with the sickle-cell disease. 
The different forms (A and T in this example) of a 
SNP are called alleles. Most SNPs have only two al- 
leles in the population. Diploid organisms, such as 
humans, have two homologous copies of each chro- 
mosome. Thus, the genotype (i.e., the specific al- 
lelic makeup) of an individual may be AA, TT or 
AT in this example. A phenotype is a morphologi- 
cal feature of the organism controlled or affected by 
a genotype. Different genotypes may produce the 
same phenotype. In this example, individuals with 
genotype TT have a very high risk of the sickle-cell 
disease. A haplotype is a combination of alleles at 
multiple SNP loci that are transmitted together on 
the same chromosome. In other words, haplotypes 
are sets of phased genotypes. An example is given 
in Figure 4, which shows the genotypes of three in- 
dividuals at four SNP loci. For the first individual, 
the arrangement of its alleles on two chromosomes 
must be ACAC and ACGC, which are the haplo- 
types compatible with its observed genotype data. 
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One of the main tasks of genetic studies is to lo- 
cate genetic variants (mainly SNPs) that are associ- 
ated with inheritable diseases. If we know the hap- 
lotypes of all related individuals, it will be easier 
to rebuild the evolutionary history and locate the 
disease mutations. Unfortunately, the phase infor- 
mation needed to build haplotypes from genotype 
information is usually unavailable because labora- 
tory haplotyping methods, unlike geno typing tech- 
nologies, are expensive and low-throughput. 

The use of the EM algorithm has a long history in 
population genetics, some of which predates Demp- 
ster, Laird and Rubin (1977). For example, Cep- 
pellini, Siniscalco and Smith (1955) invented an EM 
algorithm to estimate allele frequencies when there 
is no one-to-one correspondence between phenotype 
and genotype; Smith (1957) used an EM algorithm 
to estimate the recombination frequency; and Ott 
(1979) used an EM algorithm to study genotype- 
phenotype relationships from pedigree data. Weeks 
and Lange (1989) reformulated these earlier appli- 
cations in the modern EM framework of Dempster, 
Laird and Rubin (1977). Most early works were single- 
SNP Association studies. Thompson (1984) and Lan- 
der and Green (1987) designed EM algorithms for 
joint linkage analysis of three or more SNPs. With 
the accumulation of SNP data, more and more re- 
searchers have come to realize the importance of 
haplotype analysis (Liu et al., 2001). Haplotype re- 
construction based on genotype data has therefore 
become a very important intermediate step in dis- 
ease association studies. 

The haplotype reconstruction problem is illustra- 
ted in Figure 4. Suppose we observed the genotype 
data Y = (Yi, . . . , Y n ) for n individuals, and we wish 
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Fig. 4. Haplotype reconstruction. We observed the genotypes 
of three individuals at 4 SNP loci. The 1st and 3rd individuals 
each have a unique haplotype phase, whereas the 2nd individ- 
ual has two compatible haplotype phases. We pool all possi- 
ble haplotypes together and associated with them a haplotype 
frequency vector (61, . . . ,6a). Each individual's two haplotypes 
are then assumed to be random draws (with replacement) from 
this pool of weighted haplotypes. 



to predict the corresponding haplotypes T = (Ti, . . . , 
r n ), where Tj = (Tf, T^) is the haplotype pair of the 
ith. individual. The haplotype pair Tj is said to be 
compatible with the genotype Yi , which is expressed 



as r? 



Yi, if the genotype Yi can be gener- 



ated from the haplotype pair. Let H = (Hi, . . . , H m ) 
be the pool of all distinct haplotypes and let = 
(6\, . . . ,6 m ) be the corresponding frequencies in the 
population. 

The first simple model considered in the litera- 
ture assumes that each individual's genotype vector 
is generated by two haplotypes from the pool cho- 
sen independently with probability vector 0. This 
is a very good model if the region spanned by the 
markers in consideration is sufficiently short that 
no recombination has occurred, and if mating in the 
population is random. Under this model, we have 



w»)=n e 



6j9k 

If r is known, we can directly write down the MLE 
of as 9j = Tgjk where the sufficient statistic nj is 
the number of occurrences of haplotype Hj in T. 
Therefore, in the EM framework, we simply replace 
rij by its expected value over the distribution of T 
when r is unobserved. More specifically, the EM 
algorithm is a simple iteration of 



2n 



where 0^ is the current estimate of the haplotype 
frequencies, and rij is the count of haplotypes Hj 
that exist in Y. 

The use of the EM algorithm for haplotype analy- 
sis has been coupled with the large-scale generation 
of SNP data. Early attempts include Excoffier and 
Slatkin (1995), Long, Williams and Urbanek (1995), 
Hawley and Kidd (1995) and Chiano and Clayton 
(1998). One problem of these traditional EM ap- 
proaches is that the computational complexity of the 
E-step grows exponentially as the number of SNPs 
in the haplotype increases. Qin, Niu and Liu (2002) 
incorporated a "partition-ligation" strategy into the 
EM algorithm in an effort to surpass this limitation. 
Lu, Niu and Liu (2003) used the EM for haplotype 
analysis in the scenario of case-control studies. Kang 
et al. (2004) extended the traditional EM haplotype 
inference algorithm by incorporating genotype un- 
certainty. Niu (2004) gave a review of general algo- 
rithms for haplotype reconstruction. 
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6. FINITE MIXTURE CLUSTERING FOR 
MICROARRAY DATA 

In cluster analysis one seeks to partition observed 
data into groups such that coherence within each 
group and separation between groups are maximized 
jointly. Although this goal is subjectively defined 
(depending on how one defines "coherence" and "sep- 
aration" ) , clustering can serve as an initial explorato- 
ry analysis for high-dimensional data. One example 
in computational biology is microarray data anal- 
ysis. Microarrays are used to measure the mRNA 
expression levels of thousands of genes at the same 
time. Microarray data are usually displayed as a 
matrix Y. The rows of Y represent the genes in 
a study and the columns are arrays obtained in dif- 
ferent experiment conditions, in different stages of 
a biological system or from different biological sam- 
ples. Cluster analysis of microarray data has been a 
hot research field because groups of genes that share 
similar expression patterns (clustering the rows of 
Y) are often involved in the same or related biolog- 
ical functions, and groups of samples having a sim- 
ilar gene expression profile (clustering the columns 
of Y) are often indicative of the relatedness of these 
samples (e.g., the same cancer type). 

Finite mixture models have long been used in clus- 
ter analysis (see Fraley and Raftery, 2002 for a re- 
view) . The observations are assumed to be generated 
from a finite mixture of distributions. The likelihood 
of a mixture model with K components can be writ- 
ten as 

n K 

P(Y|0i, ...,e K ;n,.. ., TK ) = \{Y, T kfk{^i\e k ), 

i=lfc=l 

where f k is the density function of the kth compo- 
nent in the mixture, Q k are the corresponding pa- 
rameters, and Tfc is the probability that an observed 
datum is generated from this component model (r k > 
0,^ fc Tfc = 1). One of the most commonly used fi- 
nite mixture models is the Gaussian mixture model, 
in which k is composed of mean and covari- 
ance matrix S^. Outliers can be accommodated by 
a special component in the mixture that allows for 
a larger variance or extreme values. 

A standard way to simplify the statistical compu- 
tation with mixture models is to introduce a variable 
indicating which component an observation Yj was 
generated from. Thus, the "complete data" can be 
expressed as Xj = (Y», Tj), where Tj = (jn , . . . , j iK ) , 



and 7j/u = 1 if Yj is generated by the feth compo- 
nent and 7^ = otherwise. The complete-data log- 
likelihood function is 

logP(Y,r\o u ...,o K ;Ti,...,T K ) 

n K 

= Y,12^og[r k f k (Y i \e i )}. 

i=i fc=i 

Since the complete-data log-likelihood function is 
linear in the jj k s, in the E-step we only need to 
compute 

- E(j ik \&U, Y) = f f $ Yil0 *\ t) • 
Ef^rf/^Y^f) 

The Q-function can be calculated as 

n K 

(2) Q(@\&^) = Y J Y,^og[T k f k (Y i \6 i )}. 

i=l fc=i 

The M-step updates the component probability r k 

as 

( t+ l) 1 - 
i=l 

and the updating of Q k would depend on the den- 
sity function. In mixture Gaussian models, the Q- 
function is quadratic in the mean vector and can be 
maximized to achieve the M-step. 

Yeung et al. (2001) are among the pioneers who 
applied the model-based clustering method in mi- 
croarray data analysis. They adopted the Gaussian 
mixture model framework and represented the co- 
variance matrix in terms of its eigenvalue decompo- 
sition 

S k = \ k D k A k Dl. 

In this way, the orientation, shape and volume of 
the multivariate normal distribution for each clus- 
ter can be modeled separately by eigenvector ma- 
trix D k , eigenvalue matrix A k and scalar X k , respec- 
tively. Simplified models are straightforward under 
this general model setting, such as setting A^, D k 
or A k to be identical for all clusters or restricting 
the covariance matrices to take some special forms 
(e.g., Sfc = \ k I). Yeung and colleagues used the EM 
algorithm to estimate the model parameters. To im- 
prove convergence, the EM algorithm can be initial- 
ized with a model-based hierarchical clustering step 
(Dasgupta and Raftery, 1998). 
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When Yj has some dimensions that are highly cor- 
related, it can be helpful to project the data onto a 
lower-dimensional subspace. For example, McLach- 
lan, Bean and Peel (2002) attempted to cluster tis- 
sue samples instead of genes. Each tissue sample is 
represented as a vector of length equal to the num- 
ber of genes, which can be up to several thousand. 
Factor analysis (Ghahramani and Hinton, 1997) can 
be used to reduce the dimensionality, and can be 
seen as a Gaussian model with a special constraint 
on the covariance matrix. In their study, McLach- 
lan, Bean and Peel used a mixture of factor ana- 
lyzers, equivalent to a mixture Gaussian model, but 
with fewer free parameters to estimate because of 
the constraints. A variant of the EM algorithm, the 
Alternating Expectation-Conditional Maximization 
(AECM) algorithm (Meng and van Dyk, 1997), was 
applied to fit this mixture model. 

Many microarray data sets are composed of sev- 
eral arrays in a series of time points so as to study 
biological system dynamics and regulatory networks 
(e.g., cell cycle studies). It is advantageous to model 
the gene expression profile by taking into account 
the smoothness of these time series. Ji et al. (2004) 
clustered the time course microarray data using a 
mixture of HMMs. Bar-Joseph et al. (2002) and Luan 
and Li (2003) implemented mixture models with 
spline components. The time-course expression data 
were treated as samples from a continuous smooth 
process. The coefficients of the spline bases can be 
either fixed effect, random effect or a mixture effect 
to accommodate different modeling needs. Ma et al. 
(2006) improved upon these methods by adding a 
gene-specific effect into the model: 

Uij — l^k^ij) &iji 

where //&(£) is the mean expression of cluster k at 
time t, composed of smoothing spline components; 
bi ~ N(0,a 2 k ) explains the gene specific deviation 
from the cluster mean; and ~ N(0,a 2 ) is the 
measurement error. The Q-function in this case is 
a weighted version of the penalized log- likelihood: 



(3) 
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,i=i 
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where the integral is the smoothness penalty term. 
A generalized cross-validation method was applied 
to choose the values for a 2 k and 



lik 



An interesting variation on the EM algorithm, the 
rejection-controlled EM (RCEM), was introduced in 
Ma et al. (2006) to reduce the computational com- 
plexity of the EM algorithm for mixture models. In 
all mixture models, the E-step computes the mem- 
bership probabilities (weights) for each gene to be- 
long to each cluster, and the M-step maximizes a 
weighted sum function as in Luan and Li (2003). 
To reduce the computational burden of the M-step, 
we can "throw away" some terms with very small 
weights in an unbiased weight using the rejection 
control method (Liu, Chen and Wong, 1998). More 
precisely, a threshold c (e.g., c = 0.05) is chosen. 
Then, the new weights are computed as 

max{7jfc,c}, with probability min{l,7jfc/c}, 
0, otherwise. 

The new weight 7^ then replaces the old weight 7^ 
in the Q-function calculation in (2) in general, and 
in (3) more specifically. For cluster k, genes with 
a membership probability higher than c are not af- 
fected, while the membership probabilities of other 
genes will be set to c or 0, with probabilities lik/c 
and 1 — lik/c, respectively. By giving a zero weight 
to many genes with low 7^ / c, the number of terms 
to be summed in the Q-function is greatly reduced. 

In many ways finite mixture models are similar 
to the K-means algorithm, and they may produce 
very similar clustering results. However, finite mix- 
ture models are more flexible in the sense that the 
inferred clusters do not necessarily have a sphere 
shape, and the shapes of the clusters can be learned 
from the data. Researchers such as Suresh, Dinakaran 
and Valarmathie (2009) tried to combine the two 
ways of thinking to make better clustering algo- 
rithms. 

For cluster analysis, one intriguing question is how 
to set the total number of clusters. Bayesian infor- 
mation criterion (BIC) is often used to determine 
the number of clusters (Yeung et al. (2001); Fra- 
ley and Raftery (2002); Ma et al. (2006)). A ran- 
dom subsampling approach is suggested by Dudoit, 
Fridlyand and Speed (2002) for the same purpose. 
When external information of genes or samples is 
available, cross-validation can be used to determine 
the number of clusters. 

7. TRENDS TOWARD INTEGRATION 

Biological systems are generally too complex to 
be fully characterized by a snapshot from a sin- 
gle viewpoint. Modern high-throughput experimen- 
tal techniques have been used to collect massive 
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amounts of data to interrogate biological systems 
from various angles and under diverse conditions. 
For instance, biologists have collected many types 
of genomic data, including microarray gene expres- 
sion data, genomic sequence data, ChlP-chip bind- 
ing data and protein-protein interaction data. Cou- 
pled with this trend, there is a growing interest in 
computational methods for integrating multiple sour- 
ces of information in an effort to gain a deeper un- 
derstanding of the biological systems and to over- 
come the limitations of divided approaches. For ex- 
ample, the Phylo-HMM in Section 4 takes as in- 
put an alignment of multiple sequences, which, as 
shown in Section 3, is a hard problem by itself. 
On the other hand, the construction of the align- 
ment can be improved a lot if we know the underly- 
ing phylogeny. It is therefore preferable to infer the 
multiple alignment and the phylogenetic tree jointly 
(Lunter et al., 2005). 

Hierarchical modeling is a principled way of inte- 
grating multiple data sets or multiple analysis steps. 
Because of the complexity of the problems, the in- 
clusion of nuisance parameters or missing data at 
some level of the hierarchical models is usually ei- 
ther structurally inevitable or conceptually prefer- 
able. The EM algorithm and Markov chain Monte 
Carlo algorithms are often the methods of choice for 
these models due to their close connection with the 
underlying statistical model and the missing data 
structure. 

For example, EM algorithms have been used to 
combine motif discovery with evolutionary informa- 
tion. The underlying logic is that the motif sites such 
as TFBSs evolved slower than the surrounding ge- 
nomic sequences (the background) because of func- 
tional constraints and natural selection. Moses, Chi- 
ang and Eisen (2004) developed EMnEM (Expecta- 
tion-Maximization on Evolutionary Mixtures) , which 
is a generalization of the mixture model formulation 
for motif discovery (Bailey and Elkan, 1994). More 
precisely, they treat an alignment of multiple orthol- 
ogous sequences as a series of alignments of length 
w, each of which is a sample from the mixture of a 
motif model and a background model. All observed 
sequences are assumed to evolve from a common an- 
cestor sequence according to an evolutionary process 
parameterized by a Jukes-Cantor substitution ma- 
trix. PhyME (Sinha, Blanchette and Tompa, 2004) 
is another EM approach for motif discovery in or- 
thologous sequences. Instead of modeling the com- 
mon ancestor, they modeled one designated "refer- 
ence species" using a two-state HMM (motif state 



or background state). Only the well-aligned part of 
the reference sequence was assumed to share a com- 
mon evolutionary origin with other species. PhyME 
assumes a symmetric star topology instead of a bi- 
nary phylogenetic tree for the evolutionary process. 
OrthoMEME (Prakash et al, 2004) deals with pairs 
of orthologous sequences and is a natural extension 
of the EM algorithm of Lawrence and Reilly (1990) 
described in Section 2. 

Steps have also been taken to incorporate micro- 
array gene expression data into motif discovery 
(Bussemaker, Li and Siggia (2001); Conlon et al. 
(2003)). Kundaje et al. (2005) used a graphical model 
and the EM algorithm to combine DNA sequence 
data with time-series expression data for gene clus- 
tering. Its basic logic is that co-regulated genes should 
show both similar TFBS occurrence in their up- 
stream sequences and similar gene-expression time- 
series curves. The graphical model assumes that the 
TFBS occurrence and gene-expression are indepen- 
dent, conditional on the co-regulation cluster assign- 
ment. Based on predicted TFBSs in promoter re- 
gions and cell-cycle time-series gene-expression data 
on budding yeast, this algorithm infers model pa- 
rameters by integrating out the latent variables for 
cluster assignment. In a similar setting, Chen and 
Blanchette (2007) used a Bayesian network and an 
EM-like algorithm to integrate TFBS information, 
TF expression data and target gene expression data 
for identifying the combination of motifs that are 
responsible for tissue-specific expression. The rela- 
tionships among different data are modeled by the 
connections of different nodes in the Bayesian net- 
work. Wang et al. (2005) used a mixture model to 
describe the joint probability of TFBS and target 
gene expression data. Using the EM algorithm, they 
provide a refined representation of the TFBS and 
calculate the probability that each gene is a true 
target. 

As we show in this review, the EM algorithm has 
enjoyed many applications in computational biology. 
This is partly driven by the need for complex sta- 
tistical models to describe biological knowledge and 
data. The missing data formulation of the EM algo- 
rithm addresses many computational biology prob- 
lems naturally. The efficiency of a specific EM al- 
gorithm depends on how efficiently we can integrate 
out unobserved variables (missing data/nuisance pa- 
rameters) in the E-step and how complex the opti- 
mization problem is in the M-step. Special depen- 
dence structures can often be imposed on the unob- 
served variables to greatly ease the computational 
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burden of the E-step. For example, the computa- 
tion is simple if latent variables are independent in 
the conditional posterior distribution, such as in the 
mixture motif example in Section 2 and the haplo- 
type example in Section 5. Efficient exact calculation 
may also be available for structured latent variables, 
such as the forward-backward procedure for HMMs 
(Baum et al., 1970), the pruning algorithm for phy- 
logenetic trees (Felsenstein, 1981) and the inside- 
outside algorithm for the probabilistic context-free 
grammar in predicting RNA secondary structures 
(Eddy and Durbin, 1994). As one of the drawbacks 
of the EM algorithm, the M-step can sometimes be 
too complicated to compute directly, such as in the 
Phylo-HMM example in Section 4 and the smooth- 
ing spline mixture model in Section 6, in which cases 
innovative numerical tricks are called for. 
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